SOCIAL NETWORK ANALYSIS- Huawei Technologies Co.

Group 4 - Assignment 2

By: Alya Al-Naimi, Maggie Hu, Rani Lottey, Mario Godinez and Oliver John Cabasan

Abstract

This study aims to use social network analysis (SNA) on Huawei Technologies Co.'s Facebook posts and comments to identify potential influencers that can be used in Huawei's marketing campaigns. Using tools like Gephi and Python to create networks and in conjuction with key SNA metrics to rank the nodes within the simulated networks, top influceners can be identified. The networks are used in a data driven online application in Gephi to show the networks and the various nodes and communities which Huawei can use to drive marketing campaigns.

1. Introduction

Social media for marketing has become an essential part of any business’ digital marketing strategy because it offers a wide range of opportunities for engagement with potential customers. Social networks like Facebook, Instagram, YouTube, and Twitter are the main platforms used in social media marketing to build relationships with followers by providing them valuable content that they’re interested in.

Nowadays, businesses use social media in a myriad of different ways. For example, a business that is concerned about what people are saying about its brand would monitor social media conversations and response to relevant mentions (social media listening and engagement). A business that wants to understand how it’s performing on social media would analyze its reach, engagement, and sales on social media with an analytics tool (social media analytics). A business that wants to reach a specific set of audience at scale would run highly-targeted social media ads (social media advertising). As a whole, these are often also known as social media management.

So, why social media marketing is important? Social media marketing is all about connecting with an audience or customers and helping them understand a brand better. By creating compelling content on social media channels, it will reach many more people than just posting it on a website or blog. If used effectively social media marketing allows successful inclusion of customers in marketing campaigns which makes them feel like a part of a business and increase their brand loyalty and helps businesses in establishing a loyal customer base.

Influencers on social media have a great hold over customers, and they affect the buying and selling decisions of online customers. By adding influencers to content and adding their opinions to a business’ posts the business will automatically gain more popularity and credibility, and creating credibility is the goal to effectively use social media in marketing a business or product. Like any marketing tactic, an influencer program takes deliberate targeting and planning.

1.1 Background

Huawei Technologies Co. Ltd. is a Chinese multinational networking, telecommunications equipment, and services company and is the largest telecommunications equipment manufacturer in the world. Huawei has 21 R&D institutes all over the world and in 2014, the company invested 6.4 billion USD in R&D, up from 5 billion USD in 2013. From July to September 2017, Huawei surpassed Apple and became the second largest smartphone manufacturer in the world after Samsung.

Huawei has a strong international presence and maintains social media profiles on all the major networks including Facebook, Twitter, LinkedIn and Instagram. They use social media marketing and social network analysis tools and techniques to enhance their business position, increase market share, drive engagements and increase revenue.

Social network analysis is an analytical tool that can be used to map and measure social relations. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them. Through quantitative metrics and robust visual displays social network analysis provides a systematic approach for investigating large amounts of data on people and relationships. Using this data driven approach social media platforms like Huawei’s Facebook can be analyzed to understand networks and the individuals and groups within them.

1.2 Business Objective and Dataset

Business Problem Statement

The main objective of this study is to use social network analysis (SNA) to systematically understand Huawei’s Facebook networks and using network metrics to help identify the most important or central individuals in the network. From finding these most central individuals it leads to reason that they are potential Huawei Facebook influencers that can be engaged for social media marketing tactics to raise Huawei’s profile and products including new product launches.

Dataset

The dataset used in this study is a publicly available dataset created May 1, 2018 and can be found at https://www.kaggle.com/andrewlucci/huawei-social-network-data/metadata; the timeframe for the data collection is not known. Data was collected by crawling through Huawei’s Facebook pages and web crawler API’s were used to extract the Facebook posts and comments. Natural language processing techniques were used to extract only positive reviews from posts after data pre-processing. The Huawei Facebook data is directed and labeled having 1,000 nodes and 250,315 edges with 1,000 rows and 1,000 columns of data.

Directed data is defined by a set of nodes (N) and a set of edges (E), where the elements of E are ordered pairs of nodes in N. Labeled data is data that has been tagged with one or more labels identifying certain properties or characteristics; labels make that data specifically useful in machine learning.

1.3 Assumptions and Limitations

While social network analysis can be a powerful tool it does have its own set of limitations and assumptions. The list below is just a few for consideration while conducting this analysis:

1.4 Alethia Framework

The Alethia Framework as a new standard of ethics and trustworthiness in artificial intelligence has been attached for further details. It includes across 32 facets of societal impact, governance and trust, and transparency and requires executives and boards to provide evidence that has been rigorously considered. Some of the basic information from this framework is summarized as follows:

2. Methodology

Within the social network, influencers and opinion leaders stand out and gain interest from a marketing perspective because they have the potential to influence the buying behavior in both their first-order contact and in the rest of their network. The key step therefore is to determine the relevant metrics that would identify these influencers to address the main problem of this study.

SNA provides several methods that can be used to describe and weigh different characteristics of the network in general, the individuals that make it up, and the connections or links between these individuals. However, this study proposes to identify these individuals through the following as key metrics:

  1. Degree Centrality - number of connections
  2. Closeness Centrality - closeness to the entire network
  3. Between Centrality - bridges nodes
  4. Eigenvector Centrality - connectedness to well connected nodes
  5. Pagerank - very similar to Eigenvector only that it focuses more on the in-degree as the main measure to estimate the influence level

Among these metrics, the key guideline in identifying a potential influencer is by using a 2-dimensional matrix of Betweenness Centrality and Eigenvector centrality through a scatter plot which is an adaptation of the matrix model presented by Scoponi et al. (2016) to classify the actors of a social network in terms of their level of influence. The graph between these two metrics shows a correlation as these two are basically complementary metrics. Members of the network that simultaneously meet the highest values of both Betweenness and Eigenvector Centrality should be classified as potential influencers as shown in the figure below.

image-2.png

This matrix represents a two-dimensional scatter plot in which the individual components of a network are plotted according to their betweenness centrality (x-axis) and their eigenvector centrality (y-axis). Then, according to a relevant criterion to determine thresholds in each dimension, this plane is divided into quadrants, allowing classifying every actor into four different groups:

  1. Potential influencers with a high degree of betweenness and eigenvector centrality.
  2. Brokers or individuals with high betweenness centrality and low eigenvector centrality.
  3. Actors with important connections, their low score in betweenness centrality suggesting a limited outreach to groups outside their local community; and
  4. Secondary actors.

Similarly the primary focus of this study is to identify the first group - the potential influencers which belongs to a quadrant with high Betweenness and high Eigenvector centrality. To improvise these two-dimensional method, aside from Eigenvector and Betweenness, this study will also factor-in Degree Centrality, Pagerank and Closeness Centrality. The key influencer will be selected based on the ranking of all these key metrics.

Although both Betweenness and Eigenvector Centrality can greatly impact the scale of the marketing campaigns of Huawei, it is important also to know who has the most direct connections or most active individuals in the network. This is where Degree Centrality can help as a preliminary step to identify the said individuals who are more likely to have friends as they are popular as reflective in their connections.

For a marketing campaign under budgetary constraints, speed in the transfer of knowledge or word of mouth between consumers is also important. Speed is greatly facilitated by the closeness of the individuals in the network which means information and influence are easily pass on when individuals are directly linked to each other. This is why Closeness Centrality is also considered as key metric of this study. It will identify these individuals who have access to nodes in the network more quickly than anyone else. They have the shortest paths to all others as they are close to everyone else.

2.1 Tools

The following tools are used for this study:

  1. Gephi – evaluation tool
  2. Python Networkx – evaluation tool
  3. Github – main repository
  4. SigmaJS Explorer (Gephi Plug-in) – tool for the deployment

2.2 Data Extraction from Gephi

Initial analysis is extracted from Gephi using its convenient capability to analyze network metrics all at once after successful loading of the dataset.The exported analysis of the different metrics is then fed into a Python environment for further data manipulation and exploration to draw out the potential key influencers.

Two datasets were extracted from Gephi:

  1. The full Facebook dataset containing 100% nodes and edges,
  2. The short listed Facebook dataset containing only the nodes in the 75th percentile (top 25%) with the most number of connections (degrees).

The second dataset is intended to analyze a short listed dataset containing only top 25% individuals with very high degrees or connections in the full dataset network. The threshold however, can be adjusted according to the purpose of the analysis, the size and characteristics of the network, operational and budgetary constraints should the company like Huawei decide launching a demand generating campaign that aims for a minimum number of individuals impacting a maximum number of reach. Additionally, through the second dataset, this study can also identify key individuals who are consistently high in all key metrics regardless of whether they are in the full dataset or the top 25% dataset of the network.

2.3 Data Preparation Steps

  1. Extract the full datasets from Gephi. This should include all metrics evaluation of Gephi which exports CSV file format that can be generated by the 'Export Table' function in the 'Data Laboratory' section of Gephi.
  2. Extract the shortlisted top 25% dataset from Gephi using this method:

    • Load the full dataset into Python and determine the number of degrees needed to filter only the 75th percentile (top 25%) of the original list. Based on the describe function in python, the top 25% should be within the 214 degrees (see table below)

    • Use the number of degrees (214) in the degree range filter of Gephi to extract the top 25%.

  1. Clean both datasets using Python
  2. Explore the datasets

Top25.png

2.4 Evaluation Steps Summary

  1. Create separate data sets for each metric and sort each one according to the metrics ranking of the users. As an output, this creates a pandas data frame for each metrics.

  2. From the metrics dataframe, create and plot a matrix to represent a two-dimensional scatter plot in which the individual components of a network are plotted according to their betweenness centrality (x-axis) and their eigenvector centrality (y-axis).

  3. Create and plot a two-dimensional matrix comparing also Pagerank vs Betweenness Centrality

  4. Create a summary of total ranking for the full network and top 25% using Word Cloud graph to bring out the top influencers for each.

Creating the WordCloud:

5 Metrics Validation:

3. Data Understanding

3.1 Initial Dataset Import and Configuration

The required libraries are imported for use and the Pandas version is checked to verify that the lastest version is being used.

A user defined function called resumetable is created which will simplify the analysis of the dataframes as it consolidates various commands for the initial data exploration

Import Datasets

Loading the dataset into Python and running resumetable to get information on the dataset. As shown below the dataset has 1,000 columns and 1,000 records. There are not any missing values.

3.2 Data Extaction using Gephi

Using Gephi, two cvs data files are generated by the 'Export Table' function in the 'Data Laboratory' section of Gephi. The two datasets extracted from Gephi are as follows:

  1. The full Facebook dataset containing 100% nodes and edges and
  2. A short listed Facebook dataset containing only the nodes in the top 25th percentile with the most number of connections (degrees).

Details on the files generated are noted below:

The two data files from Gephi were read into dataframes using the nomenclature noted below. In addition all the computed graph statistics from Gephi were imported into Python so that they could be used for further analysis.

3.3 Cleaning of the Two Datasets

Unused columns are removed from both dataframes and the indices are reset.

Remove the \xa0 from the users ID for both datasets.

3.4 Exploration of the Cleaned Datasets

Using the cleaned datsets further initial exploration can be conducted on them to obtain some basic statistics and correlations.

Heatmaps of Metrics for the Cleaned Datasets

Both heatmaps for each dataset shows that the Eccentricty and Modularity Class are the only metrics that have a low correlation; all other metrics are highy correlated which is expected.

Distribution of Nodes By Communities

The networks created in Gephi had nine distinct communities; the distrbution of the nodes within each community for each of the datasets is shown in the plots below. It is interesting to note that the distibution of the top 25% dataset is much more even than the full network dataset. The full network dataset shows 3-4 communities which have a higher count of nodes. This is expected since the top 25% dataset has already been filtered using the most number of connections (degrees).

Distribution of Nodes By Degree Value

The distribution of the nodes by Degree count is shown below for both datasets. In both cases the distributions have a normal bell curve which indicates further fitting or filtering is not required.

3.5 Analysis of the Datasets using Various Metrics

Seperate datasets were created from the cleaned full dataset (clean_gdf_f) and the top 25% dataset (clean_gdf_c) to analyze the following commonly used metrics in network analysis:

By looking at the highest ranking results it can be seen that there are a number of frequent occuring names i.e Zack. Ishku, Homer, Ernie, Rawail, etc. Further analysis can be completed by investigating using scatterplots of some of the metrics.

3.6 Scatterplots of Some Key Metrics

Pagerank Centrality vs Betweeness Centrality Scatterplot

A scatterplot of the individual components of a network are plotted according to their betweenness centrality (x-axis) and the pagerank centrality (y-axis) as shown below. The plot between these two metrics shows a correlation as these two metrics are basically complementary. Members shown in the top right hand corner of the graph have the highest values of both Betweenness and Pagerank Centrality which can be classified as potential influencers. This quadrant of the plot is of the most interest for this study.

However, to note is that these metrics provide an initial list of potential influencers and this list can be improved upon by including the other metrics previously listed i.e. Degree Centrality, Closeness Centraility and Pagerank.

Eigenvector Centrality vs Betweeness Centrality Scatterplot

Similar to the plot above another scatterplot with Eigenvector Centrality and Betweeness Centrality can be created as shown below. The results in the top right hand quadrant are very similar to the previous plot since Eigenvector and Pagerank are similar metrics except that Eigenvector uses indegree more heavily to compute the influence in the network.

3.7 Comparison of Datasets using WordCloud

As another analyis of the two datasets a WordCloud was used to illustrate and explore how the top members differ if the full network dataset is used compared to the top 25% dataset with the highly connected members.

The WordClouds created use the aggregated top ranked nodes from each metric for each of the datasets

The table below shows the top ranked members for the full network dataset for each metric.

The table below shows the top ranked members for the top 25% dataset for each metric.

Creating the WordCloud

The text variable for the WordCloud is created by concatenating all the columns.

Generated WordClouds

The two WordClouds are generated below with the first Wordcloud from the full network dataset and the second WordCloud from the top 25% dataset. As expected the WordClouds show different members as the most important. This of interest since how the dataset is chosen can greatly affect who would be considered as members of interest i.e. influcencers. Further analysis is required to understand the effects of the dataset choosen.

4. Web Application

The following tools were used to created an interactive visual web application:

4.1 Web App Deployment

The web app deployment steps are noted below:

  1. Install plugins Install all necessary plug-in:
  1. Import graph
    • Open GEXF file in Gephi (File->Open...)
  1. Compute layout
    • Use Multigravity Force Atlas 2 as the lay-out. The main benefit of this algorithm is that graphs can be coerced into certain shapes based on some data related to the data set.
    • Parameters:
      • Scaling: 10
      • Edge Weight Influence: 0
      • Dissuade hubs: True
  1. Export SigmaJS template
    • Export SigmaJS template File->Export->SigmaJS template and fill in all required fields.
  1. Test locally
    • Go to the exported folder and start a simple Python HTTP server to test the visualization. Depending on Python version, in the terminal, type the following command:
      • Python 3.X python -m http.server
      • Python 2.7 python -m SimpleHTTPServer
    • Using the web-browser (works best in Chrome), go to http://localhost:8000/ to interact with the graph.
  1. Compute attributes.
    • Attributed layouts can be used to enhance readability. Before doing that, compute first for attributes.
      • Modularity (Use weights: False)
      • Average Degree
  1. Color nodes according to their community
    • Color nodes according to their modularity class and make their size correspond to their degree.
  1. Circle Pack layout
    • Use Circle Pack layout to rearrange nodes according to attributes. Use modularity and degree as parameters.
  1. Scale and labels
    • Adjust the scale and labels. Use Expansion layout to increase the scale of the layout. Display labels, reduce the font size and use Label Adjust layout to prevent overlapping node labels.
  1. Export SigmaJS template once again and test it locally
  1. Publish your graph on GitHub pages.
    • Go to GitHub Page.
    • Go to the settings of your repository and find GitHub pages section. Specify master branch as a source.
    • Clone the repository.
    • Copy the exported SigmaJS template that we have prepared to the cloned folder.
    • Push the files to the repository.
    • Check the website with your interactive visualization https://[YOUR-GITHUB-USER-NAME].github.io/[VIS-REPOSITORY-NAME]/network/

4.2 Web App Production

Full Network Dataset: https://3toneles.github.io/networkF/
Top 25% Dataset: https://3toneles.github.io/networkC214/

Full Dataset Web Deploymentfulldataset.png

Top 25% Web Deploymenttop25_app.png

Example Network of a NodeZack_network.png

5. Conclusions and Recommendations

The 2-dimensional matrix of Betweenness Centrality and Eigenvector Centrality is a good start in identifying top potential influencers. With the scatter plot, the matrix model is able to classify actors of a social network in terms of their level of influence. Members of the network that simultaneously meet the highest values of both Betweenness and Eigenvector Centrality is identified as potential influencers which can be spotted at Quadrant 1 of the scatter plot graph.

To increase robustness of the 2-dimensional matrix method, this study has proposed to include other important matrix to identify key influencers. The relationship between the 5 chosen metrics of this study was validated as highly correlated through the correlation matrix or heatmap.

From the top ranking individuals per metric, computed by Gephi, Wordcloud was able to summarize the names or nodes which can be considered as top ranked potential influencers or opinion leaders across the 5 metrics as proposed in the methodology of this study.

As a recommendation, much exploration needs to be done on the dynamics of the identified communities which are 8 groups in the full dataset network while 7 groups in the short listed network, the top 25%. Though Gephi was able to extract and identify these groups, further study needs to be done as to what defines a group of densely interconnected nodes that are only sparsely connected with the rest of the network. It is important that these communities are properly defined on top of the different properties and statitics being displayed vs the average network such as node degree, clustering coefficient, betweenness, centrality,etc. Understanding each community plays a big role in the success of every marketing campaigns.